Presidential Speeches

Topical and Lexical Similarity

Harman Singh

John Cabot University

Background

  • Much criticism of Donald Trump has centered on him being unfit for the Office of President of the United States, or his “unpresidentiality”.

  • Some of this criticism also emerges from Donald Trump’s rhetorical style, which has also been deemed “unpresidential”.

  • But what does “Presidentiality” mean? Are there common traits, character qualities, rhetorical styles, or other elements that are common to US Presidents?

Research Question

  • Have US Presidents throughout history given similar speeches and official addresses to each other?

  • What have been the most common topics they have given speeches on and how have they changed over time?

Data Sources

  • The Miller Center at the University of Virginia’s ‘Presidential Speeches’ has speeches available from George Washington till today are available in text format

  • Most speeches are official addresses, remarks, or statements

  • Available free to download as JSON file format to the public

  • Collection is not exhaustive, it is extensive and contains over 1,000 speeches within it

Methodology

Selecting the Data

  • Temporal shifts in American society; Realignment in American domestic/foreign policy

  • Only Presidents from 20th Century onward - Theodore Roosevelt (1901)

  • Only speeches while President, no campaigns or other speeches, consistency

  • Historical trend of topics relevant today but not include archaic topics (slavery, railroads, etc.)

  • Minimal cleaning to maintain semantic and contextual coherence for my models

BERTopic

  • BERTopic groups similar speeches using language patterns which identified key themes automatically with advanced language models

  • Looks at context of words in relation to each other to find and build common topics

  • Top five topics
  • Topics spoken about by US Presidents are temporally contingent

Cosine Similarity using Word Embeddings

  • Measures how similar speeches are by comparing them in vector space

  • Uses word embeddings to capture semantic meaning, not just keywords

  • Helps identify subtle language patterns and thematic connections

Cosine Similarity using TF-IDF

  • Compares speeches based on word frequency adjusted by overall rarity

  • Effective for spotting shared vocabulary across texts

  • Less suited for capturing semantic meaning compared to embedding strategy

  • Gerald Ford: 0.56; Donald Trump: 0.62

Conclusion

  • Firstly, that the proportion of topic prevalence in Presidential speeches is time dependent – it fluctuates in accordance with changes in the political spheres, either global or domestic.

  • Secondly, both with content and vocabulary, there is a similarity between Presidents from Coolidge until Clinton, after which there is a break and a set of new similarities that begin.

  • Thirdly and lastly, while Donald Trump’s rhetorical choices have been criticized as the great break from previous Presidents, this is only verifiable in terms of topic/thematic consistency and not in terms of vocabulary.

Appendix

This is a technical appendix for the operations performed to create this memo.

Part 1: Loading the Data

import json as json

with open('speeches.json', 'r') as file:
  speeches = json.load(file)
import pandas as pd
df = pd.json_normalize(speeches)

Part 2: Cleaning and Organizing the Text

keep_presidents = [
    "Theodore Roosevelt", "William Taft", "Woodrow Wilson",
    "Warren G. Harding", "Calvin Coolidge", "Herbert Hoover",
    "Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower",
    "John F. Kennedy", "Lyndon B. Johnson", "Richard M. Nixon",
    "Gerald Ford", "Jimmy Carter", "Ronald Reagan",
    "George H. W. Bush", "Bill Clinton", "George W. Bush",
    "Barack Obama", "Donald Trump", "Joe Biden"
]

df_new = df[df["president"].isin(keep_presidents)]

df_new = df_new.drop(['doc_name', 'title'], axis=1)

df_new.sort_values("date", inplace=True)
df_new.reset_index(drop=True, inplace=True)
president_terms = {
    "Theodore Roosevelt": ("1901-09-14", "1909-03-04"),
    "William Taft": ("1909-03-04", "1913-03-04"),
    "Woodrow Wilson": ("1913-03-04", "1921-03-04"),
    "Warren G. Harding": ("1921-03-04", "1923-08-02"),
    "Calvin Coolidge": ("1923-08-02", "1929-03-04"),
    "Herbert Hoover": ("1929-03-04", "1933-03-04"),
    "Franklin D. Roosevelt": ("1933-03-04", "1945-04-12"),
    "Harry S. Truman": ("1945-04-12", "1953-01-20"),
    "Dwight D. Eisenhower": ("1953-01-20", "1961-01-20"),
    "John F. Kennedy": ("1961-01-20", "1963-11-22"),
    "Lyndon B. Johnson": ("1963-11-22", "1969-01-20"),
    "Richard M. Nixon": ("1969-01-20", "1974-08-09"),
    "Gerald Ford": ("1974-08-09", "1977-01-20"),
    "Jimmy Carter": ("1977-01-20", "1981-01-20"),
    "Ronald Reagan": ("1981-01-20", "1989-01-20"),
    "George H. W. Bush": ("1989-01-20", "1993-01-20"),
    "Bill Clinton": ("1993-01-20", "2001-01-20"),
    "George W. Bush": ("2001-01-20", "2009-01-20"),
    "Barack Obama": ("2009-01-20", "2017-01-20"),
    "Donald Trump": ("2017-01-20", "2021-01-20"),
    "Joe Biden": ("2021-01-20", "2025-01-20"),
    "Donald Trump": ("2024-01-20", "2025-04-27"),
}
df_new['date'] = pd.to_datetime(df_new['date'], format='ISO8601', utc=True, errors='coerce')
df_new = df_new.dropna(subset=['date'])
df_new['date'] = df_new['date'].dt.date
for pres, (start, end) in president_terms.items():
    start_date = pd.to_datetime(start).date()
    end_date = pd.to_datetime(end).date() if end else pd.Timestamp.today().date()
    president_terms[pres] = (start_date, end_date)
def was_president_at_time(row):
    pres = row['president']
    date = row['date']
    
    if pres in president_terms:
        start, end = president_terms[pres]
        return start <= date <= end
    return False  


df_proper = df_new[df_new.apply(was_president_at_time, axis=1)].reset_index(drop=True)

Part 3: BERTopic Analysis

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np
def clean_text(text):
    text = text.replace('\n', ' ')
    return text.strip()
df_proper['cleaned_text'] = df_proper['transcript'].apply(clean_text)
def chunk_text(text, max_words=300):
    words = text.split()
    return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]

df_proper['chunks'] = df_proper['cleaned_text'].apply(chunk_text)

docs_chunked = [chunk for chunks in df_proper['chunks'] for chunk in chunks]
df_proper = df_proper.reset_index(drop=True)  
df_proper['speech_id'] = df_proper.index + 1
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import contextlib
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(random_state=42)

#embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
#vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
  #embedding_model=embedding_model,
  #vectorizer_model=vectorizer_model,
  calculate_probabilities=True,
  verbose=False,
  umap_model = umap_model,
  #top_n_words=7,
  #nr_topics="auto",
)

topics, probs = topic_model.fit_transform(docs_chunked)
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]

topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))

frequency_table = pd.DataFrame(topic_info_simple)
chunked_data = []

for idx, row in df_proper.iterrows():
    chunks = chunk_text(row['cleaned_text'], max_words=300)
    for chunk in chunks:
        chunked_data.append({
            "original_speech_id": row['speech_id'],
            "president": row['president'],
            "date": row['date'],
            "transcript": chunk
        })


df_chunked = pd.DataFrame(chunked_data)
df_chunked['topic'] = topics
excluded_topics = [-1]

def reassign_topic(topic, prob_row):
    if topic in excluded_topics:
        sorted_indices = np.argsort(prob_row)[::-1]
        for idx in sorted_indices:
            if idx not in excluded_topics:
                return idx
        return topic  
    else:
        return topic

df_chunked["topic"] = [
    reassign_topic(t, p) for t, p in zip(df_chunked["topic"], probs)
]
topic_labels = {
    row["Topic"]: row["Representation"] 
    for _, row in topic_info_simple.iterrows()
}

df_chunked["topic_label"] = df_chunked["topic"].map(topic_labels)
df_chunked['year'] = pd.to_datetime(df_chunked['date']).dt.year

Data Visualization

top_5_topics = df_chunked['topic'].value_counts().head(5).index
df_top_5 = df_chunked[df_chunked['topic'].isin(top_5_topics)]
df_count_by_year = df_top_5.groupby(['year', 'topic']).size().reset_index(name='count')

df_total_by_year = df_chunked.groupby('year').size().reset_index(name='total')
df_count_by_year = pd.merge(df_count_by_year, df_total_by_year, on='year')

df_count_by_year['proportion'] = df_count_by_year['count'] / df_count_by_year['total']
df_count_by_year['topic_labels'] = df_count_by_year["topic"].map(topic_labels)
manual_labels = {
  0: "Vietnam War",
  1: "Health Care",
  7: "Banks, Credit, Gold",
  2: "Peace, Nations, War",
  3: "Rights, Blacks, White",
  
}

df_count_by_year['manual_labels'] = df_count_by_year["topic"].map(manual_labels)
library(ggplot2)
count_by_year <- reticulate::py$df_count_by_year

ggplot(count_by_year, aes(x = year, y = proportion, fill = manual_labels)) +
  geom_area() +
  theme_minimal() +
  labs(title = "Top 5 Topics in Presidential Speeches Over Time",
       x = "Year",
       y = "Proportion of Speeches",
       fill = "Topic") +
  scale_fill_viridis_d() +
  theme(plot.title = element_text(size = 12,face='bold'),
        legend.position = "bottom",
        legend.text = element_text(size = 6))
ggplot(count_by_year, aes(x = year, y = proportion)) +
  geom_line(aes(color = manual_labels)) +
  facet_wrap(~ manual_labels, scales = "free_y") +  
  theme_minimal() +
  labs(title = "Top 5 Topics in Presidential Speeches Over Time",
       x = "Year",
       y = "Proportion of Speeches",
       color = "Topic") +
  scale_color_viridis_d() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 10,face='bold')
  )

Part 4: Cosine Similarity

Word Embedding

topics, probs = topic_model.transform(df_chunked["transcript"].tolist())
embeddings = topic_model._extract_embeddings(df_chunked["transcript"].tolist(), method="document")
  
df_chunked["embedding"] = list(embeddings)
president_embedding = df_chunked.groupby("president")["embedding"].apply(
    lambda emb_list: np.mean(np.vstack(emb_list), axis=0)
    )
from sklearn.metrics.pairwise import cosine_similarity

X = np.vstack(president_embedding.values)
cosine_similarity_matrix = cosine_similarity(X)
presidents = president_embedding.index.tolist()

similarity_df = pd.DataFrame(cosine_similarity_matrix, index=keep_presidents, columns=keep_presidents)

Word Embedding Visualization

library(reshape2)
library(viridis)


similarity_matrix <- reticulate::py$similarity_df
similarity_matrix <- as.matrix(similarity_matrix)
melted_matrix <- melt(similarity_matrix, varnames = c("president_1", "president_2"))

ggplot(melted_matrix, aes(x = president_1, y = president_2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.3) +  
  scale_fill_viridis(
    option = "viridis",  # Try "magma", "plasma", or "inferno" for other variants
    direction = -1,     
    limits = c(min(melted_matrix$value), max(melted_matrix$value)) 
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 10),
    legend.position = "right",
    plot.title = element_text(size = 10,face='bold')  

  ) +
  labs(
    x = "President",
    y = "President",
    title = "Word Embedding - Cosine Similarity of Presidential Speeches",
    fill = "Similarity"
  ) +
  coord_fixed()  

TF-IDF Cosine Similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)
presidents_aggregated = df_proper.groupby('president')['cleaned_text'].apply(" ".join).reset_index()

presidents_aggregated['cleaned_text'] = presidents_aggregated['cleaned_text'].apply(remove_stopwords)

vectorizer = CountVectorizer()
president_dfm = vectorizer.fit_transform(presidents_aggregated['cleaned_text'])
pres_cosine = cosine_similarity(president_dfm,president_dfm)

similarity_df_2 = pd.DataFrame(
    pres_cosine,
    index=keep_presidents,
    columns=keep_presidents
)

TF-IDF Visualization

similarity_matrix_2 <- reticulate::py$similarity_df_2
similarity_matrix_2 <- as.matrix(similarity_matrix_2)
melted_matrix_2 <- melt(similarity_matrix_2, varnames = c("president_1", "president_2"))


ggplot(melted_matrix_2, aes(x = president_1, y = president_2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.3) +  
  scale_fill_viridis(
    option = "viridis",  # Try "magma", "plasma", or "inferno" for other variants
    direction = -1,     
    limits = c(min(melted_matrix_2$value), max(melted_matrix_2$value)) 
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 10),
    legend.position = "right",
    plot.title = element_text(size = 10, face='bold')  

  ) +
  labs(
    x = "President",
    y = "President",
    title = "TF-IDF Cosine Similarity of Presidential Speeches",
    fill = "Similarity"
  ) +
  coord_fixed()